2020-Jan-13 (updated: 2019-12-11)
ggplottidyverseImportant in science, business, journalism, etc.
Two key roles:
Summary statistics never give the full picture:
ggplot: Grammar of GraphicsMany graphing programs treat data visualization like painting picture: red circle at \(x_1,y_1\), blue square at \(x_2,y_2\), etc.
This is inefficient and very easy to make mistakes.
ggplot: Grammar of GraphicsMany graphing programs treat data visualization like painting picture: red circle at \(x_1,y_1\), blue square at \(x_2,y_2\), etc.
This is inefficient and very easy to make mistakes.
ggplot way: a graph has 3 core elementsA basic line plot:
Add points
Notice that the lines and points have matching colors, but you only need to specify the color mapping once.
All geoms will “inherit” the same aesthetic mappings unless otherwise specified. This helps maintain consistency.
Do not need to have different data sets for all data and summarized data.
Can just use one data set and visualize it in different ways:
This makes exploratory visualization much easier.
ggplot(subset(Orange, Tree != "5"), aes(age, circumference)) +
geom_point() +
stat_summary(fun.y=mean, geom="line")This also helps exploratory visualization.
Very important for visualizing complex data sets. The individual facets are properly sized and aligned, and inherit aesthetics to make comparisons easy.
For combining different plots, use patchwork
library(patchwork)
p1 <- ggplot(esoph, aes(agegp, ncases/ncontrols, color=alcgp)) +
stat_smooth(aes(group=alcgp), se=FALSE) +
scale_color_brewer(palette = "Set1")
p2 <- ggplot(esoph, aes(agegp, ncases/ncontrols, color=tobgp)) +
stat_smooth(aes(group=tobgp), se=FALSE) +
scale_color_brewer(palette = "Set1")
p1 + p2## month day TempAve TempMin
## Min. : 1.000 Min. : 1.00 Min. : 0.6423 Min. :-1.993
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 5.6862 1st Qu.: 3.431
## Median : 7.000 Median :16.00 Median : 9.0329 Median : 6.486
## Mean : 6.526 Mean :15.72 Mean : 9.4095 Mean : 6.664
## 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:13.1557 3rd Qu.:10.230
## Max. :12.000 Max. :31.00 Max. :19.2297 Max. :14.720
## TempMax Rain
## Min. : 2.273 Min. : 0.000
## 1st Qu.: 8.340 1st Qu.: 0.000
## Median :12.030 Median : 0.400
## Mean :12.438 Mean : 3.192
## 3rd Qu.:16.620 3rd Qu.: 2.800
## Max. :24.470 Max. :46.800
ggplot(weather_day, aes(month, TempAve)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_se, geom="errorbar")ggplot(weather_day, aes(as.factor(month), TempAve)) +
geom_violin() +
geom_jitter(width=0.1, alpha=0.25) +
geom_smooth(aes(x=month))ggplot(weather_day, aes(month, TempAve)) +
geom_smooth(color="black") +
geom_jitter(width=0.1, alpha=0.5, aes(color=TempAve)) +
theme_bw(base_size=12) +
scale_color_gradient(low="navy", high="red", guide=FALSE) +
labs(x="Month", y="Temperature",
title="Edinburgh temperatures in 2014")tidyverseRaw data are rarely in a convenient format for analysis.
Need to be pre-processed to be ready for analysis.
Approx. 80% of analysis time is spent on getting data ready for analysis.
.Rdata files
save(dat1, dat2, ..., file="FileName.RData")load("FileName.RData")dplyrfilter rows (observations)
select columns (variables)
arrange rows
mutate create new variables
summarize create summaries
group_by create grouped summaries (and mutations and filters)
merge data sets (base R)
join data sets (dplyr)
tidyrTidy data
“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” - Hadley Wickham
tidyrTidy data
pivot_longerpivot_widerEasy way to do a series of operations:
g(f(x, y), z) is the same as x %>% f(y) %>% g(z)
But easier to read and write: “Take x then do f(y) then do g(z)”
A few interesting and “robust” indicators:
## -- Attaching packages ----------------------------------------- tidyverse 1.3.0 --
## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3
## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0
## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0
## <U+2713> purrr 0.3.3
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#Get data from the PHE server
mh_phe <- fingertips_data(IndicatorID = c(848, 90646, 90275, 41001),
AreaTypeID = 102) %>%
select(IndicatorID, IndicatorName,
County = AreaName, Region = ParentName,
Timeperiod, Value) %>%
mutate(Year = as.numeric(str_sub(Timeperiod, 1, 4))) %>%
filter(Region != "England") %>%
mutate(Region = str_remove_all(Region, " region"))## IndicatorID IndicatorName County Region
## Min. : 848 Length:9422 Length:9422 Length:9422
## 1st Qu.:41001 Class :character Class :character Class :character
## Median :41001 Mode :character Mode :character Mode :character
## Mean :44945
## 3rd Qu.:41001
## Max. :90646
##
## Timeperiod Value Year
## Length:9422 Min. : 0.0532 Min. :2001
## Class :character 1st Qu.: 5.6154 1st Qu.:2006
## Mode :character Median : 9.5482 Median :2011
## Mean :12.7162 Mean :2010
## 3rd Qu.:14.2544 3rd Qu.:2014
## Max. :71.1897 Max. :2017
## NA's :619
## 'data.frame': 9422 obs. of 7 variables:
## $ IndicatorID : int 848 848 848 848 848 848 848 848 848 848 ...
## $ IndicatorName: chr "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" "Depression: Recorded prevalence (aged 18+)" ...
## $ County : chr "Hartlepool" "Middlesbrough" "Redcar and Cleveland" "Stockton-on-Tees" ...
## $ Region : chr "North East" "North East" "North East" "North East" ...
## $ Timeperiod : chr "2013/14" "2013/14" "2013/14" "2013/14" ...
## $ Value : num 5.32 5.74 7.86 9.19 7.38 ...
## $ Year : num 2013 2013 2013 2013 2013 ...
mh_phe %>% group_by(IndicatorName, Region) %>%
summarize(Value = mean(Value, na.rm = TRUE)) %>%
pivot_wider(names_from = IndicatorName, values_from = Value)## # A tibble: 9 x 5
## Region `Depression: QOF … `Depression: Rec… `Percentage of ph… `Suicide rate`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 East M… 1.49 8.68 56.9 9.90
## 2 East o… 1.35 8.03 56.6 9.96
## 3 London 1.00 6.01 57.3 9.61
## 4 North … 1.47 9.01 52.9 12.4
## 5 North … 1.65 9.69 53.1 11.6
## 6 South … 1.27 8.05 58.7 10.6
## 7 South … 1.33 8.30 58.6 10.9
## 8 West M… 1.44 8.73 54.1 10.1
## 9 Yorksh… 1.43 8.80 55.2 10.7
mh_phe %>% group_by(IndicatorName, Year) %>%
summarize(Value = mean(Value, na.rm = TRUE)) %>%
pivot_wider(names_from = IndicatorName, values_from = Value) %>%
arrange(Year)## # A tibble: 17 x 5
## Year `Depression: QOF … `Depression: Reco… `Percentage of ph… `Suicide rate`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2001 NA NA NA 11.2
## 2 2002 NA NA NA 11.0
## 3 2003 NA NA NA 10.9
## 4 2004 NA NA NA 10.8
## 5 2005 NA NA NA 10.4
## 6 2006 NA NA NA 10.3
## 7 2007 NA NA NA 10.4
## 8 2008 NA NA NA 10.3
## 9 2009 NA NA NA 10.2
## 10 2010 NA NA NA 10.3
## 11 2011 NA NA NA 10.6
## 12 2012 NA NA 55.7 10.7
## 13 2013 1.04 6.44 55.5 10.8
## 14 2014 1.18 7.24 56.5 10.6
## 15 2015 1.43 8.17 56.5 10.4
## 16 2016 1.51 9.04 NA 10.3
## 17 2017 1.55 9.78 NA NA
read.table, haven::read_sav, readxl::read_excel, load(url(...)), etc.ggplotfilter, group_by, mutate, gather, spread, etc.ggplott.test, lm, lmer, sem, etc.ggplotUse R Markdown for steps 2-7 to produce a fully reproducible analysis report.
(Can also do step 8 and generate a final PDF or Word document of the full report)
Prerequisites
rmarkdown and knitr packagesCreate a markdown file
Excerpt from the R Markdown file for this lecture:
ggplot
tidyverse
RData filesdplyr: filter, select, group_by, mutate, summarize, arrangetidyr: pivot_longer, pivot_wider